Diagnostic performance of deep learning in ultrasound diagnosis of breast cancer: a systematic review

Deep learning (DL) has been widely investigated in breast ultrasound (US) for distinguishing between benign and malignant breast masses. This systematic review of test diagnosis aims to examine the accuracy of DL, compared to human readers, for the diagnosis of breast cancer in the US under clinical settings. Our literature search included records from databases including PubMed, Embase, Scopus, and Cochrane Library. Test accuracy outcomes were synthesized to compare the diagnostic performance of DL and human readers as well as to evaluate the assistive role of DL to human readers. A total of 16 studies involving 9238 female participants were included. There were no prospective studies comparing the test accuracy of DL versus human readers in clinical workflows. Diagnostic test results varied across the included studies. In 14 studies employing standalone DL systems, DL showed significantly lower sensitivities in 5 studies with comparable specificities and outperformed human readers at higher specificities in another 4 studies; in the remaining studies, DL models and human readers showed equivalent test outcomes. In 12 studies that assessed assistive DL systems, no studies proved the assistive role of DL in the overall diagnostic performance of human readers. Current evidence is insufficient to conclude that DL outperforms human readers or enhances the accuracy of diagnostic breast US in a clinical setting. Standardization of study methodologies is required to improve the reproducibility and generalizability of DL research, which will aid in clinical translation and application.


INTRODUCTION
Breast cancer is the world's most prevalent cancer and remains the major cause of cancer-associated deaths globally.GLOBCAN estimated that in 2020, there were about 2.3 million women diagnosed with breast cancer and 685,000 breast cancerassociated deaths worldwide 1 .Early and accurate diagnosis results in better patient outcomes.Breast ultrasound (US) is low-cost, easy-to-operate, radiation-free, portable, and typically helpful for distinguishing between a cystic and a solid breast mass.The effectiveness of the US as a diagnostic tool for palpable breast abnormalities is widely recognized, especially in cases involving dense breast tissues or mammographically occult lesions [2][3][4] .Additionally, the US is considered the preferred imaging method for providing guidance during breast biopsy procedures 5,6 .However, the diagnostic efficacy and reproducibility of US examinations are relatively low due to their dependence on the knowledge and experience of the operators 7,8 .
Deep learning (DL), an innovative artificial intelligence (AI) technology, excels at image-related tasks, including abnormities detection, segmentation, and classification (Fig. 1).The integration of DL into the US imaging workflow offers numerous benefits, including improved efficiency, reduced errors, and automated quantitative assessments 9 .Consequently, significant efforts have been made to facilitate the clinical application of DL in medical imaging.For instance, the DL-based ultrasonography system known as S-Detect (Samsung Medison, Seoul, Korea) has gained increasing popularity for breast cancer diagnosis.This system enables automatic segmentation and interpretation of US morphological descriptions, providing a dichotomous classification (possibly benign or possibly malignant) that serves as a reference for radiologists during the final diagnostic process 10 .
Several recent reports have suggested that DL-based interpretation of breast US is on par with or even superior to that of a human radiologist [11][12][13][14][15] .However, the application of DL in clinical practice remains controversial and results vary across different studies.Current reviews 10,16 focused on evaluating the application potentials of commercial products, such as S-Detect.There is a paucity of evidence-based systematic reviews specific to the general diagnostic performance of employing DL models in clinical practice of breast US, in particular comprehensive comparison between DL and human readers.Our work aims to assess current evidence on the diagnostic performance of DL algorithms in the detection and classification of breast lesions in clinical US tests, including (1) whether standalone DL systems outperform radiologists in breast cancer diagnosis and (2) whether assistive DL systems can improve diagnostic performance when used in concert with human radiologists.

Diagnostic performance comparison
DL can function either as a standalone system where the algorithms independently generate diagnostic decisions, or as an assistant to radiologists where the final diagnosis is made by radiologists considering the DL outcomes.Consequently, the development of a successful DL product necessitates not only the construction of robust DL algorithms but also the exploration of how the algorithm outputs can enhance radiologists' diagnostic capabilities.It is crucial to investigate the usefulness of DL outputs for radiologists, quantify the benefits of DL in patient care, and determine strategies to optimize these advantages.
In test accuracy comparison between DL systems and human readers, 4 studies evaluated the diagnostic performance of DL systems as standalone 19,22,26,31 , 2 studies employed assistive DL systems 17,21 , and another 10 studies assessed the roles of DL systems as both standalone and assistive systems 18,20,[23][24][25][27][28][29][30]32 . Those stuies employed human readers at various levels of clinical experiences in breast US and investigated the performance of DL systems compared to experienced and less experienced human readers.

Standalone DL systems
In 14 studies using DL as a standalone system, the diagnostic accuracy of DL and human readers was compared (Table 2).In a study 20 conducted by Cho et al. found DL had lower AUC than human readers.Two studies 22,24 showed DL was equivalent to human readers in AUC.In contrast, another study 32 reported a higher AUC of DL than human readers.More specifically, DL had superior AUC over less experienced human readers while comparable to experienced human readers in three studies 19,24,29 .As for accuracy, DL systems were more accurate than all human readers in two studies 24,32 .Wei et al. 29 reported that DL was more accurate than less experienced human readers while comparable to experienced human readers.In contrast, another study showed DL was equivalent to less experienced human readers while more accurate than experienced human readers.In addition, standalone DL had lower sensitivity than overall human readers in five studies 19,20,24,30,32 .Another two studies 26,28 found that DL was more sensitive than less experienced human readers but less sensitive than experienced human readers.In four studies 19,20,24,32 , DL exhibited higher specificity than overall human readers.In another study 26 , DL was more specific than less experienced human readers but less specific than experienced human readers.The remaining studies did not report comparable diagnostic measures between DL systems and human readers.

Assistive DL systems
In 12 studies that assessed assistive DL systems (Table 2), three studies 18,27,32 reported improved AUC of human readers when combining with DL systems.Another study 20 showed assistive DL had a comparable AUC to human readers alone.To investigate the assistive effects of DL on human readers with different experiences, two studies 17,24 found that assistive DL systems had higher AUC than less experienced human readers but the positive impacts did not work for experienced human readers.In accuracy tests, assistive DL systems were more accurate than human readers in three studies 20,24,32 .However, no studies showed improved overall sensitivity of the combination of DL and human readers compared to human readers alone.One study 28 reported improved sensitivity of an assistive DL system compared to less experienced human readers but this advantage was not maintained when used by experienced human readers.Improved specificity in overall human readers was reported in seven studies 18,20,21,24,27,28,32 that used assistive DL systems.Interestingly, in a study 17 reported by Park and coworkers, the assistive DL technology improved diagnostic specificity among experienced human readers but not among inexperienced readers.While in another study 20 , less experienced human readers were aided in terms of specificity by the assistive DL system.
In Fig. 3, we estimated the sensitivity and specificity of DL systems and average human readers.We tentatively infer both standalone and assistive DL systems are more specific than average human readers while whether they are more sensitive remains unclear.However, complete 2 × 2 contingency tables were not available in most studies so that we were unable to conduct a thorough diagnostic analysis for all included studies.

Quality assessment
Based on QUADAS-2 and QUADAS-C tools, we tailored the signal questions in four domains, including patient selection, index tests, reference standard, flow, and timing, to assess the quality and applicability of included studies (Supplementary Table 5).The studies with low, high, or unclear risk of bias and applicability concerns were summarized in Table 3, Figs. 4 and 5. Most studies showed a high risk of bias in the four domains.For example, the average cancer prevalence of included lesions was 39.5%, ranging from 6% to 64.7% (Supplementary Table 4 and Supplementary Fig. 1), which far exceeds the prevalence in screening and diagnostic settings 33 .This led to a high risk of bias in patient selection.Additionally, most study designs did not represent a complete US testing pathway applicable to clinical practice.For example, DL systems were used for image reading, but not integrated into clinical decisions, such as diagnosis, further tests, or follow-up.In contrast, the choice of patient management (e.g., biopsy, follow-up) to confirm disease status was based on the decision of the human readers rather than standalone or assistive DL systems.Meanwhile, for human readers, the testing pathway was also not applicable to clinical routines where they have access to patient's clinical information as well as prior US images.The reference standards varied among the included 16 studies, of which 4 studies 17,22,25,28 were at high risk of bias because the follow-up time of women with negative tests was <2 years, which is shorter than the recommended follow-up interval 33 and therefore may underestimate the rate of missed cancers and overestimate diagnostic accuracy.

DISCUSSION
This review presents a comprehensive overview of diagnostic performance in breast US of DL systems, which serve as standalone roles or aids to human readers.We identified 16 studies that compared the test accuracy measures of a commercial or in-house DL system to that of human readers.Diagnostic test outcomes  varied substantially among the included studies.While we cautiously inferred DL systems were more specific than average human readers, which might help decrease the false positives, no consensus of AUC, accuracy, and sensitivity was found either in standalone or assistive DL systems.Importantly, one of the main concerns of DL studies is better imaging sensitivity might come at the cost of increased false positives and vice versa.Critical performance metrics such as AUC, accuracy, sensitivity, specificity, true positive, false positive, false negative, and true negative should be taken into consideration together.However, not all included studies reported these diagnostic measures.Although most of the included studies (14/16) use FDA-approved DL systems, the clinical effects of DL systems as standalone or assistive roles have not been fully revealed yet due to the lack of generalizable reporting or good study design.Therefore, our systematic review disagrees with findings from various publications, some of which have claimed that DL systems (e.g., S-Detect) outperform humans 18,20,24 and have a significant role in assisting human readers in distinguishing between benign and malignant breast masses 10,16 .It does not necessarily mean that the DL algorithm in breast US itself is unreliable.It contrarily provides the directions for future improvement for this promising technology.Our review found high heterogeneity stemming from study designs, methods, targeted populations, diagnostic measures, and human readers' experiences, which hinders the comparability of evidence across included studies.There was a wide variation in the number and pathological type of selected lesions.Thirteen studies evaluated fewer than 500 women while the outcomes of another three studies were based on many more participants.Promising results from small populations may not be applicable to larger populations.In addition, the malignant proportions far exceed the cancer prevalence in the real world, which inevitably overestimates the sensitivity.Importantly, most of the included studies originated in Asia, and mostly at a single site, which may affect the external validity of reported results.Furthermore, compared with Caucasian women, Asian women generally have denser breasts and younger ages of onset of breast cancer.Discrepancies in race and ethnicity make it difficult to extrapolate the positive findings among Asian participants to multi-race and multi-ethnic populations.Hence, multicenter studies from different countries that recruit participants from multiple races and ethnicities are required to achieve higher applicability of these studies.Additionally, the test cutoff values varied among studies with some using BIRADS-4a while some using BIRADS-4b as the threshold for classifying malignancies.In this regard, test bias could have been introduced.These studies also set various definitions of experienced or less experienced human readers, which might lead to contrary conclusions among some studies.Furthermore, the included studies have some variation in reference standards, including pathological confirmation and follow-up time (7-35 months).The methods for obtaining pathological results were also inconsistent, including histopathologic results from US-guided biopsy, vacuum-assisted excision, or open surgery.These discrepancies suggest that accuracy evaluations are not comparable among studies.Overall, the current evidence base is not of sufficient quality to support a broad clinical practice recommendation of DL systems in breast US.
Furthermore, compared to other medical imaging modalities, such as MRI, DL-assisted US shows intrinsic limitations, which hinders its clinical applicability.For example, US imaging is dependent on its operators, resulting in high intra-and interobserver variability in image acquisition and interpretation.Moreover, unlike MRI images viewing the whole lesion range, still US images are obtained from parts of targeted organs, which may cause under-representation or over-exaggeration.Additionally, US technology has been evolving fast over the recent decades.Older ultrasonograms are generally of lower resolution and higher noise, while up-to-date images are of higher resolution and lower noise.Thus, DL models that are trained with older images may not be externally valid for images acquired by advanced devices.Methodological considerations are highly demanded for generalized conclusions from DL studies in US technology.
In this systematic review, we followed an established methodology and stringent inclusion criteria and tailored the quality assessment tools for included studies.Our emphasis on comparisons with the diagnostic performance of humans in clinical practice may explain why our conclusions are more cautious than many of the papers we reviewed herein.Importantly, according to previous studies and the current guidelines, internal validation where training and validation were performed based on the same dataset, such as cross-validation, tends to overestimate accuracy and has limited generalizability because of overfitting 33 .Hence, at the initial stage of literature identification, only studies using external validation of test sets were included.Therefore, our work can provide a purposeful insight into the role of DL in the US diagnosis of breast cancer.However, this systematic review excluded non-English publications, which might introduce selection bias.In addition, we were unable to calculate comprehensive diagnostic measures due to insufficient data where accuracy, true positive, false positive, true negative, false negative, and statistical difference (or raw data to calculate) were not reported.
To ensure reproducibility and generalizability of the results of this promising technology, we recommend developing standardized DL research guidelines for further investigations.Aligned study designs, agreed-upon benchmarking data sets, complete performance metrics, standard imaging protocols and reporting formats, consistent cutoff values and reference standards will help decrease the heterogeneity and bias.Furthermore, multicenter studies are highly demanded to determine the diagnostic accuracy of DL products.Prospective, randomized controlled trials that are applicable to clinical testing pathways are significantly important to examine DL's role in a clinical environment.Also, we need to identify the DL products with the best performance in terms of accuracy, efficiency, availability, cost-effectiveness, and safety to improve clinical workflows.DL-based breast US diagnosis is still in its infancy, and considerable efforts are needed to realize its positive impacts on radiologists and patients.

Protocol and registration
This systematic review was conducted following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses of Diagnostic Test Accuracy (PRISMA-DTA) statement 34 .Our review protocol was registered on the International Prospective Register of Systematic Reviews (PROSPERO: CRD42022349609).

Literature search
Literature searches were conducted by two librarians (H.B. and J.B.) to identify relevant studies published in English from four databases: PubMed, Embase, Scopus, and Cochrane Library.The publication time of studies was set from inception to 18 January 2023.The literature search was performed based on five themes: breast cancer, US, AI, accuracy, and diagnostic.The search keywords and strategies are shown in Supplementary Tables 6  and 7.

Study selection
Two reviewers (Q.D. and Z.X.) independently reviewed the titles and abstracts of all retrieved records for further identification according to the inclusion and exclusion criteria.Subsequently, Fig. 5 Graphic display of QUDAS-2 and QUDAS-C for studies using assistive DL systems.The proportion of studies with low, high, unclear risk of bias and concerns regarding applicability.
the identified publications were screened by reviewing the full texts for final inclusion.Any discrepancies were resolved through discussion to reach a final consensus.
We applied rigorous inclusion and exclusion criteria to evaluate the integration of DL into clinical breast cancer diagnosis using the US.We included studies that focused on: (1) evaluating DL algorithms for breast cancer diagnosis using US; (2) assessing the test accuracy of DL algorithms for breast lesion diagnosis using US; and (3) utilizing histologically confirmed and/or follow-up reference standards.We excluded studies that: (1) did not compare the diagnostic performance of DL algorithms to that of human readers; (2) lacked external validation; (3) did not employ DL algorithms (e.g., utilizing traditional AI without binary classification or final decision); (4) solely focused on detecting specific cancer subtypes (e.g., ductal or lobular carcinoma) rather than overall diagnostic accuracy; (5) did not report diagnostic metrics beyond the receiver operating characteristic area under the curve (AUC); (6) involved participants under the age of 18; (7) included participants with implants, lactation, prior known breast cancer, or prior breast treatments such as surgery, radiation therapy, and chemotherapy; (8) enrolled male patients.

Data extraction
Study characteristics and test accuracy outcomes were independently extracted by two reviewers (Q.D. and Z.X.) from all included studies.Any disagreements were resolved by discussion.Extracted study characteristics included study design, population, US device vendors, dataset characteristics (training/validation/testing set), descriptions of the DL algorithms, descriptions of the human readers, reference standards, and any other pertinent information.Test performance characteristics included accuracy, AUC, sensitivity, and specificity.

Quality assessment
Two reviewers (Q.D. and Z.X.) independently assessed the quality of the selected studies using Quality for Assessment of Diagnostic Studies-2 (QUADAS-2) and QUADAS-C tools tailored to our review questions based on a breast US test pathway applicable to clinical settings (Supplementary Table 5).For risk of bias, patient selection, index tests, reference standards, flow, and timing were assessed, respectively.For applicability concerns, patient selection, index test, and reference standards were assessed.Any disagreements were resolved by discussion.

Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Fig. 1
Fig. 1 Schematic illustration of clinical US examination workflow and the image-related task where DL-based system could have a large impact.a Clinical US workflow comprises image acquisition, image analysis (which may involve DL), report generation, and further procedures based on diagnostic reports.b A DL system comprises multiple layers where feature extraction, selection, and ultimate classification are performed simultaneously during training.US images as input are analyzed and the DL model gives binary classification (benign or malignant).Final assessment is made based on the decision of the DL system alone or in combination with human radiologists.

Fig. 2
Fig.2PRISMA diagram of included and excluded studies at each stage of the review.Sixteen publications were included in the database (PubMed, Embase, Scopus, and Cochrane Library) after removing duplicates, irrelevant studies, and studies that did not meet the inclusion criteria.

a
Category 4a as the cut-off value.b Category 4b as the cut-off value12 .c If both the assessments of longitudinal and transverse sections from the DL model were possibly benign, the final BIRADS category would be downgraded.d If any of the assessments from DL were possibly benign, the final BIRADS category would be downgraded16 .e Sequential reading mode.f Simultaneous reading mode6 .gIf the DL model assessed the lesion as malignant or benign, the final BIRADS classification would be upgraded or downgraded by one level.h The BIRADS assessment was flexibly adjusted by human readers after combining DL's outcomes.Q Dan et al.

Fig. 3
Fig. 3 Estimated sensitivity and specificity of standalone/assistive DL systems and human readers.a Sensitivities of standalone DL systems and average human readers.b Specificities of standalone DL systems and average human readers.c Sensitivities of assistive DL systems and average human readers.d Specificities of assistive DL systems and average human readers.Error bar represents SD.

Table 1 .
Characteristics of 14 studies using standalone DL systems and 12 studies using assistive DL systems.

Table 2 .
Test outcomes of standalone and assistive DL systems.

Table 2 continued
Acc accuracy, Sen sensitivity, Spe specificity, NS not significant, NA not applicable.